-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completed functionality for admin check
command
#5348
base: 3.1
Are you sure you want to change the base?
Conversation
- Implemented SYSTEM_CONFIG check: - Checks ZooKeeper locks for Accumulo server processes - Checks ZooKeeper table nodes - Checks that the WAL metadata in ZooKeeper is valid - Added new SERVER_CONFIG check: - Checks that all configured properties are valid - Checks that required properties are present in the config - Added new tests in `AdminCheckIT` for SYSTEM_CONFIG and SERVER_CONFIG. Added failing test cases for all checks. - Deleted CheckServerConfig (run via `accumulo check-server-config`) as the new `accumulo admin check run SERVER_CONFIG` will inherently do the same checks.
@@ -34,19 +35,29 @@ | |||
import java.util.Set; | |||
import java.util.TreeMap; | |||
import java.util.function.Supplier; | |||
import java.util.regex.Pattern; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment about AdminCheckIT
: considered changing from ConfigurableMacBase to SharedMiniCluserBase for this test from suggestion of @dlmarion. Prior to these new changes, SharedMiniClusterBase was sufficient and much faster, but some of these tests now do damage to the MAC to test for failure cases of the checks (e.g., deleting essential files, messing with metadata), so stuck with ConfigurableMacBase
import com.google.auto.service.AutoService; | ||
|
||
@AutoService(KeywordExecutable.class) | ||
public class CheckServerConfig implements KeywordExecutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted CheckServerConfig since now this is checked through this SERVER_CONFIG check: the ServerContext object is constructed the same (see ServerUtilOpts) and we end up calling context.getConfiguration() in the SERVER_CONFIG check, so the same checks are done with a few additions. Maybe we don't want to completely get rid of this check yet. Maybe it would be better to have the check just output that the check has moved under accumulo admin check
?
// there are many properties that should be set (default value or user set), identifying them | ||
// all and checking them here is unrealistic. Some property that is not set but is expected | ||
// will likely result in some sort of failure eventually anyway. We will just check a few | ||
// obvious required properties here. | ||
Set<Property> requiredProps = Set.of(Property.INSTANCE_ZK_HOST, Property.INSTANCE_ZK_TIMEOUT, | ||
Property.INSTANCE_SECRET, Property.INSTANCE_VOLUMES, Property.GENERAL_THREADPOOL_SIZE, | ||
Property.GENERAL_DELEGATION_TOKEN_LIFETIME, | ||
Property.GENERAL_DELEGATION_TOKEN_UPDATE_INTERVAL, Property.GENERAL_IDLE_PROCESS_INTERVAL, | ||
Property.GENERAL_LOW_MEM_DETECTOR_INTERVAL, Property.GENERAL_LOW_MEM_DETECTOR_THRESHOLD, | ||
Property.GENERAL_PROCESS_BIND_ADDRESS, Property.GENERAL_SERVER_LOCK_VERIFICATION_INTERVAL, | ||
Property.MANAGER_CLIENTPORT, Property.TSERV_CLIENTPORT, Property.GC_CYCLE_START, | ||
Property.GC_CYCLE_DELAY, Property.GC_PORT, Property.MONITOR_PORT, Property.TABLE_MAJC_RATIO, | ||
Property.TABLE_SPLIT_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any other important ones I left out? Any I shouldn't have included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change for this PR, but wondering if this should be pushed to the validation code of each property. For example could we attempt to do something like the following in addition to the code that goes through the defined props above. Thinking if a props validation fails on null or empty string then its "required" and should be set. Looking at some of the important props, like instance.volumes
their types would need change from something besides STRING that is more specific, which would be a good general change to make (would be good to have specific type to do validation for instance volumes and that could include not accepting empty string).
for(var prop : Property.values()) {
var value = config.get(prop);
if (!Property.isValidProperty(prop.getKey(), value)) {
log.warn("Invalid property (key={} val={}) found in the config", prop, value);
}
}
If the rest of the code worked like this, then would not need this list here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the code is currently written it loops over all the props the loop in the prev comment may not work well because the get method replaces w/ the default value when not present.
In general it seems like it would be best to move the concept of a required property into the Property class in some form. Then the entire system could react appropriately when a required property is not present and is requested. For now a list in this class seems fine.
I experimented w/ validating the volume prop in #5365 based on the exploration done as part of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea to me. There are a lot of PropertyType.STRING
when a String isn't really what the property is. PropertyType
exists to check that the property is valid; setting the PropertyType
to STRING
is just a way to ignore this validation. I wonder if it would be best for a 1 to 1 mapping Property
to PropertyType
. This would probably be overkill though, another idea could be to analyze the PropertyType
s that are always valid. From briefly looking, PropertyType.PATH
, PropertyType.STRING
, PropertyType.URI
are always valid. I don't think any properties should always be valid. Those that are PropertyType.STRING
could probably be given a more appropriate existing PropertyType
or a new one. PATH
and URI
could have validation.
In addition to this, can analyze all properties, determine if they are required or not, and change the validation:
- Properties that are not required could accept empty string, null, or a valid value (where validity is well defined)
- Properties that are required could only accept a valid value
Like you said, for this PR, can just push this list of required properties into Property
. Maybe for now/in this PR this list of required props is only accessed/checked in this admin check
. Might be a bit of a scope creep to start checking this list elsewhere in the code. Could do it in follow on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely want to avoid any scope creep in this PR. Identified some areas that need improvement based on this work, we can open follow on issues or PRs for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could leave the list as is in the PR. For follow on issues, do we need two issues? One for addressing the STRING types and another for somehow representing and documenting required props in the Property.java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either would be fine, but it might be easier as just one issue. There would be overlap in these changes so might be hard to split up/work on as two separate issues/PRs. For example, instance.volumes
would need to be a required property (which would be tied to validation in it's PropertyType
) and would need to be moved away from PropertyType.STRING
public Set<ColumnFQ> requiredColFQs() { | ||
return Set.of(MetadataSchema.TabletsSection.TabletColumnFamily.PREV_ROW_COLUMN, | ||
MetadataSchema.TabletsSection.ServerColumnFamily.DIRECTORY_COLUMN, | ||
MetadataSchema.TabletsSection.ServerColumnFamily.TIME_COLUMN, | ||
MetadataSchema.TabletsSection.ServerColumnFamily.LOCK_COLUMN); | ||
MetadataSchema.TabletsSection.ServerColumnFamily.TIME_COLUMN); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the required col FQs and col fams in RootTableCheckRunner, RootMetadataCheckRunner, and MetadataTableCheckRunner look correct now? If these look good, since these are all equivalent now, can push these up to no longer override and return the same thing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah those look good.
log.trace("Checking if the WAL path {} found in the metadata exists in HDFS...", | ||
parseRes.getSecond()); | ||
if (!context.getVolumeManager().exists(parseRes.getSecond())) { | ||
log.warn("WAL metadata for tserver {} references a WAL that does not exist", | ||
tserverInstanceAtWals); | ||
status = Admin.CheckCommand.CheckStatus.FAILED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this part might is a bit much to be checking here
public enum ServerProcess { | ||
MANAGER, GC, TSERVER, COMPACTION_COORDINATOR, COMPACTOR, MONITOR, SCAN_SERVER | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there already an existing enum? Considered ServiceLockData.ThriftService but that isn't quite what I'm looking for (missing a type for ScanServer and Monitor for example). ServerType in accumulo-minicluster is what I'm looking for but wouldn't make sense to use a minicluster class here. Is there an enum containing all the types of servers an instance may be running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 4.0 have this type
accumulo/core/src/main/java/org/apache/accumulo/core/client/admin/servers/ServerId.java
Line 39 in 121758a
public enum Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything in 3.x and 4.x?
// check that essential server processes have a ZK lock failing otherwise | ||
// check that nonessential server processes have a ZK lock only if they are running. If they are | ||
// not running, alerts the user that the process is not running which may or may not be expected | ||
for (ServerProcess proc : serverProcesses) { | ||
log.trace("Looking for {} lock(s)...", proc); | ||
switch (proc) { | ||
case MANAGER: | ||
// essential process | ||
status = checkLock(zkRoot + Constants.ZMANAGER_LOCK, proc, true, zs, status); | ||
break; | ||
case GC: | ||
// essential process | ||
status = checkLock(zkRoot + Constants.ZGC_LOCK, proc, true, zs, status); | ||
break; | ||
case TSERVER: | ||
// essential process(es) | ||
final var tservers = TabletMetadata.getLiveTServers(context); | ||
if (tservers.isEmpty()) { | ||
log.warn("Did not find any running tablet servers!"); | ||
status = Admin.CheckCommand.CheckStatus.FAILED; | ||
} | ||
break; | ||
case COMPACTION_COORDINATOR: | ||
// nonessential process | ||
status = checkLock(zkRoot + Constants.ZCOORDINATOR_LOCK, proc, false, zs, status); | ||
break; | ||
case COMPACTOR: | ||
// nonessential process(es) | ||
if (compactors.isEmpty()) { | ||
log.debug("No compactors appear to be running... This may or may not be expected"); | ||
} | ||
for (String compactor : compactors) { | ||
// for each running compactor, ensure a zk lock exists for it | ||
boolean checkedLock = false; | ||
String compactorQueuesPath = zkRoot + Constants.ZCOMPACTORS; | ||
var compactorQueues = zrw.getChildren(compactorQueuesPath); | ||
// find the queue the compactor is in | ||
for (var queue : compactorQueues) { | ||
String compactorQueuePath = compactorQueuesPath + "/" + queue; | ||
String lockPath = compactorQueuePath + "/" + compactor; | ||
if (zrw.exists(lockPath)) { | ||
status = checkLock(lockPath, proc, true, zs, status); | ||
checkedLock = true; | ||
break; | ||
} | ||
} | ||
if (!checkedLock) { | ||
log.warn("Did not find a ZooKeeper lock for the compactor {}!", compactor); | ||
status = Admin.CheckCommand.CheckStatus.FAILED; | ||
} | ||
} | ||
break; | ||
case MONITOR: | ||
// nonessential process | ||
status = checkLock(zkRoot + Constants.ZMONITOR_LOCK, proc, false, zs, status); | ||
break; | ||
case SCAN_SERVER: | ||
// nonessential process(es) | ||
if (sservers.isEmpty()) { | ||
log.debug("No scan servers appear to be running... This may or may not be expected"); | ||
} | ||
for (String sserver : sservers) { | ||
status = | ||
checkLock(zkRoot + Constants.ZSSERVERS + "/" + sserver, proc, true, zs, status); | ||
} | ||
break; | ||
default: | ||
throw new IllegalStateException("Unhandled case: " + proc); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine some of the non essential processes here in 3.x are essential in 4.x. I believe COMPACTION_COORDINATOR, COMPACTOR, SCAN_SERVER are all essential in 4.x? Is this correct? Will need to change this when merging into 4.x if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in 4.0 compactors are essential, that is the only way compaction can happen in 4.0. Scan servers are still optional in 4.0. The compaction coordinator is merged into the manager in 4.0.
...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java
Outdated
Show resolved
Hide resolved
if (!tserverInstances.equals(seenTServerInstancesAtWals)) { | ||
log.warn( | ||
"Expected WAL metadata in ZooKeeper for all tservers. tservers={} tservers seen storing WAL metadata={}", | ||
tserverInstances, seenTServerInstancesAtWals); | ||
status = Admin.CheckCommand.CheckStatus.FAILED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expect this? If no writes have occurred yet for the TServer, this check would fail, but I imagine in any normal scenario this check is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would expect tablet servers to always have a WAL, even if nothing is written. There is a time period during tserver startup where it has not yet created its first WAL after it has create its lock in ZK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, what should be done about this check? Could leave as is, remove entirely, retry for a period of time before failing, something else?
admin check
commandadmin check
command
private static Admin.CheckCommand.CheckStatus checkZKLocks(ServerContext context, | ||
Admin.CheckCommand.CheckStatus status) throws Exception { | ||
final ServerProcess[] serverProcesses = ServerProcess.values(); | ||
final String zkRoot = context.getZooKeeperRoot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting now that if #5350 is merged before this PR, then the ZooKeeper root when using ServerContext will be moved to after ZROOT+instanceId
. When that happens, the uses of zkRoot
and getZooKeeperRoot()
here will need to be omitted.
You can disregard this if this PR happens to be merged before #5350 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for letting me know. I've been keeping that PR in mind in relation to this one.
New checks for
admin check
commandAdminCheckIT
for SYSTEM_CONFIG and SERVER_CONFIG. Added failing test cases for all checks.admin check run
command functionality #4892. This describes all the checks thus far for the newaccumulo admin check
command. See this for the complete functionality of the command.MetadataConstraints.validateDataFileMetadata
, check thatstf.getPath()
exists in HDFS). This would make sure the file exists before the mutation is ever written and would also check that it exists when we run the admin check command for metadata (root metadata, root table, and metadata table): seeMetadataCheckRunner.checkColumns()
. However, this led to a ComprehensiveIT test failure, so I assume the file may not always exist before the metadata for it is written...accumulo check-server-config
) as the newaccumulo admin check run SERVER_CONFIG
will inherently do the same checks. Maybe we don't want to completely get rid of this check yet. Maybe it would be better to have the check just output that the check has moved underaccumulo admin check
?This PR along with #4957 closes #4892